fix: load WASM grammars sequentially to avoid Node 20+ race condition#40
Merged
colbymchenry merged 1 commit intoFeb 19, 2026
Merged
Conversation
web-tree-sitter has a known race condition when loading multiple WASM grammars concurrently on Node.js 19+ (V8 10.8+). External scanner symbols from one grammar can overwrite another's GOT entries, causing "bad export type" errors for TypeScript, TSX, and other languages. Replace Promise.allSettled(entries.map(...)) with a sequential for...of loop so each grammar fully initializes before the next one starts. Ref: tree-sitter/tree-sitter#2338
andreinknv
added a commit
to andreinknv/codegraph
that referenced
this pull request
May 18, 2026
Final wave of the codegraph tool-audit friction sweep. Polish: - colbymchenry#25 compare_to_ref now reports body-only-edited files explicitly. - colbymchenry#26 compare_to_ref includeEdges renders symbol names (not raw IDs), filters self-edges, drops empty file headers. - colbymchenry#27 codegraph_coverage gains sources (list) + drop modes; the two audit-residue coverage sources were removed from the index. - colbymchenry#28 role classifier ROLE_LIST_TEXT requires structural route/handler evidence — stops api_endpoint over-assignment from docstrings. - colbymchenry#29 status topBiomarkers emits an explicit clean/0-findings line. - colbymchenry#30 discover skips test-fixture indices (FIXTURE_DIR_NAMES). - colbymchenry#31 CLI reload-modules warns it has no lasting effect (ephemeral). - colbymchenry#32 session/note CLI --limit defaults aligned to MCP (20 / 50). - colbymchenry#33 CLI ask renders the verified-citations block (shared buildCitationReport helper, reused from the MCP path). - colbymchenry#34 codegraph_session gains a delete action + session delete CLI subcommand + deleteSession query helper. - colbymchenry#40 fuzzy-fallback banner extended to coverage + role symbol modes. Docs: - colbymchenry#35 find intent-mode hint references codegraph_graph, not the removed codegraph_callees / codegraph_walk. - colbymchenry#36 dead_code via=rule footer recommends via=llm. - colbymchenry#37 sql read-only rejection message names both the MCP and CLI schema-flag forms (surface-neutral). - colbymchenry#38 serve --no-write-tools help names real write-class tools. - colbymchenry#39 CLI local-chat help says "local LLM" to match MCP. Reviewer APPROVE; info findings (CLAUDE.md CLI docs, JSDoc wording) addressed. Suite 3037 passing. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
5 tasks
mbenhamd
referenced
this pull request
in mbenhamd/codegraph
May 24, 2026
…rift/diff foundation) (#38) * feat(PF-690): schema v6 + per-symbol fingerprint columns for duplicate/drift/diff infrastructure First slice of the trace/duplicate/drift roadmap that Codex + agy debated in the design RFC. Pure data infrastructure — no new CLI/MCP surface yet. PR #39 (codegraph_diff), PR #40 (codegraph_duplicates), and PR #41 (codegraph_explain) will consume these columns. ## What changed - `src/extraction/fingerprints.ts` (new, ~245 lines): SHA-256 hashes computed from the in-memory tree-sitter subtree. - `astHash` (Type-1): normalized token stream with identifiers + literals preserved exactly. Detects "same code, only whitespace/comments differ". - `astShapeHash` (Type-2): identifier leaves in non-semantic positions replaced by `_ID`. Detects "same code, renamed locals". Property/field/type identifiers preserved by type; member-access targets, callees, kwarg names, type names, import names preserved by parent-context check. - `sigHash`: SHA-256 of the signature string. Null when no signature was extracted. - Comment + whitespace stripped; trivia tokens (commas, semicolons, braces) excluded via `namedChild` walk. - Schema v6 migration (`src/db/migrations.ts:93-119`, `src/db/schema.sql`): adds 4 nullable columns to `nodes` table — `ast_hash`, `ast_shape_hash`, `sig_hash`, `call_pattern_hash`. Partial indexes on the two body hashes (`WHERE NOT NULL`) so duplicate-detection sweeps are O(log N) lookups instead of full scans. `callPatternHash` is reserved for post-resolution population by a later PR. - Extraction wiring (`src/extraction/tree-sitter.ts:425-460`): computes the three body hashes from the already-parsed tree-sitter subtree inside `createNode`. agy's RFC point — that the AST is already in memory and hashing is microseconds — verified on a 107-file codegraph src/ corpus: 2.4s with vs 2.9s without (overhead below run-to-run variance, well under Codex's ≤15% budget). - `Node` interface (`src/types.ts:165-198`): adds nullable `astHash`, `astShapeHash`, `sigHash`, `callPatternHash` fields with provenance docstrings. - `queries.ts` insertNode / updateNode / rowToNode: round-trip the fingerprint columns nullably so framework-extractor synthesized route nodes (no body) keep `null` fingerprints — downstream consumers filter with `WHERE ast_hash IS NOT NULL`. ## v1 contract (Council RFC, locked by tests) - Detects: Type-1 (whitespace/comment-insensitive clones), Type-2 (renamed-locals clones). - Does NOT detect: Type-3 (statement reorder), Type-4 (semantic equivalence). - Literal values preserved → security/config code where the literal matters does not falsely conflate. Strongest counterpoint the council named ("miss literal-only differs") explicitly accepted. ## Bug-pin verified during review Codex pass 1 caught a real BLOCKER: `tree-sitter-python` parses `obj.start()` as `attribute(identifier "obj", identifier "start")` (both children are plain `identifier`), so a type-only rename rule would have conflated `obj.start()` with `obj.stop()`. Fix: parent- context check (`shouldPreserveIdentifier`) preserves identifiers in semantic positions — `attribute` children, `call.function` field, `keyword_argument` children, types, imports. Codex round 2 caught a follow-on: Python kwargs (`g(start=1)`) — added `keyword_argument` to the semantic-parent set. ## Tests (`__tests__/fingerprints.test.ts`, 14 cases) - sigHash determinism + null on missing signature. - Determinism: same input → same hex. - Type-1: whitespace/comment edits preserve astHash. - Type-2: renamed locals share astShapeHash, NOT astHash. - Member rename diverges (TS `property_identifier` path). - Literal change diverges (security sensitivity pinned). - Control-flow reorder diverges (Type-3 NOT detected, pinned). - Python regression: `obj.start()` vs `obj.stop()` diverge (member preserved despite both being `identifier`). - Python bare callee: `start()` vs `stop()` diverge. - Python kwarg: `g(start=1)` vs `g(stop=1)` diverge. - Python param rename: same astShapeHash, different astHash. - Cross-language: TS body ≠ Python body even when semantically equivalent. ## Reviewer trail - Codex pass 1: 1 BLOCKER (Python member conflation) + 1 REVIEW (missing Python tests) + 1 NITPICK (stale comment). - Codex round 2: BLOCKER + REVIEW CLOSED. New REVIEW (Python kwarg conflation) + NITPICK (header repeated stale claim). - Codex round 3: Both round-2 findings CLOSED. Last NITPICK (kwarg comment misrepresented the over-preservation trade-off). Codex authorized "iterate for the comment fix, then ship". - Doc comment now accurately describes the trade-off: kwarg set-membership preserves any direct identifier leaf including value-side identifiers; tighter field-specific check deferred to a follow-up. ## Verification - tsc --noEmit clean - npm test: 1026 passed | 2 skipped (was 1012 on main; +14 fingerprint tests) - npm run test:eval:structural: 8/8 PASS, recall=1.00 precision=1.00 fp=0 (no regression vs main baseline) - Index-time delta on 107-file corpus: 2.4s with vs 2.9s without — below run-to-run variance, well under ≤15% target. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * fix(PF-690): cross-language member preservation + idempotent v6 migration (Codex round 4) Codex round 4 deep sweep verified via real `tree-sitter-wasms` parses that the v1 fingerprint rule conflated semantically different code in Ruby, Java, C#, and Rust. The original rule only handled Python's `attribute` shape + the `call.function` field. Each of these four languages emits plain `identifier` for member/callee positions but under DIFFERENT parent node types: - Ruby: `user.start` -> call(identifier "user", identifier "start"), method field carries the member name (not `function`). - Java: `obj.start()` -> method_invocation(identifier, identifier). - C#: `obj.Start()` -> invocation_expression > member_access_expression(identifier, identifier). - Rust: `Router::new()` -> call_expression > scoped_identifier(identifier "Router", identifier "new"). Fix: extend `SEMANTIC_PARENT_TYPES` with `method_invocation`, `member_access_expression`, `invocation_expression`, `scoped_identifier`, `scoped_call_expression`, `field_expression`. Add `call.method` field check to `shouldPreserveIdentifier` to cover Ruby's dual-purpose `call` type. Same set-membership v1 trade-off applies (accepts false negative on receiver names rather than risk semantic-name conflation). Plus Codex round 4 REVIEW: migration v6 was not idempotent under concurrent-open race. Two processes hitting a v5 database could both read version 5, both enter migration, and the second's `ALTER TABLE ADD COLUMN` would fail with duplicate-column even though the resulting schema is fine. Fixed via `PRAGMA table_info` pre-check per column so already-applied additions become no-ops. `CREATE INDEX IF NOT EXISTS` was already idempotent. Tests added (4 cross-language regressions): - Ruby `user.start` vs `user.stop` -> different astShapeHash - Java `obj.start()` vs `obj.stop()` -> different astShapeHash - C# `obj.Start()` vs `obj.Stop()` -> different astShapeHash - Rust `Router::new()` vs `Router::default()` -> different astShapeHash Each pins the specific cross-language failure mode Codex verified. Reviewer trail: - Codex round 4 (deep sweep, 6 attack vectors): found 1 BLOCKER (cross-language conflation) + 1 REVIEW (migration race). Both fixed; remaining 4 vectors confirmed clean (hash determinism, persistence completeness, ERROR/MISSING handling, createNode hook coverage). - CodeRabbit CLI: ran against the same diff, no findings. - Claude Explore subagent: returned 7 findings; 3 already covered here (cross-language tests, migration safety), 4 deferred as documentation/contract clarifications (line-ending CRLF normalization, downstream kwarg trade-off doc, callPatternHash contract clarity, SQLite version compat — node:sqlite ships SQLite 3.42+ which fully supports partial indexes). Verification: - tsc --noEmit clean - npm test: 1030 passed | 2 skipped (was 1026 last commit; +4 cross-language tests) - npm run test:eval:structural: 8/8 PASS, recall=1.00 precision=1.00 fp=0 (no regression vs baseline) - All 18 fingerprint tests pass deterministically. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
Merged
6 tasks
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Promise.allSettled(entries.map(...))with a sequentialfor...ofloop ininitGrammars()to avoid a knownweb-tree-sitterWASM race condition on Node.js 19+/20+bad export type for 'tree_sitter_tsx_external_scanner_create': undefined, causing those languages to silently fail to indexRoot Cause
web-tree-sitterWASM instantiation is not safe for concurrentLanguage.load()calls on Node.js 19+ (V8 10.8+). The external scanner symbols from one grammar can collide with another's during parallel initialization.Documented upstream:
Error: bad export type for tree_sitter_tsx_external_scanner_create: undefinedtree-sitter/tree-sitter#2338tree_sitter_typescript_external_scanner_create: undefined tree-sitter/tree-sitter-typescript#244Test plan
Failed to loadwarnings in output after the change